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Now that the hardware and software environment of the Cray-2 is more stable, some 
comparisons can be made between the performance of the Cray-2 and that of other super- 
computers at Ames, notably the Cray X-MP. The comparisons below were made on a suite 
of thirteen floating-point intensive programs typical of codes expected to run on the NAS 
High Speed Processor. There are four sets of figures for each program: 

• Cray-2 Stand-Alone: The performance of the Cray-2 running on a single CPU with 
other CPUs idle. 

• Cray-2 Simult aneous: The average performance of the four Cray-2 CPUs simultane- 
ously running the 6ame program. 

• Cray-2 Normal: The performance of a single CPU with a typical daytime background 
of jobs running in the other three CPUs. 

• Cray X-MP Normal: The performance of the Cray X-MP/12 with a normal amount 
of swapping with other jobs. 

Some of these programs have also been run on the Cray X-MP/48. The run times 
in each case were very close to the Cray X-MP/12 run times. This is largely due to the 
fast memory in the Cray X-MP computers, which minimizes the effect of memory bank 
contention in the four processor model. As a result, the Cray X-MP/12 performance 
figures can be considered to be highly accurate estimations of the performance of the Cray 
X-MP/48 on these programs. 

The numbers shown in the table below are MFLOPS, computed using a Cray timing 
routine. The floating-point operation counts were obtained using the hardware performance 
monitor on the Cray X-MP/12. It is assumed here that the number of floating-point 
operations performed on the Cray-2 is the same as on the Cray X-MP/12, although there 
may be small differences. 

Most of these codes are actual NASA Ames user codes, although some are not current 
production versions. Those that are not actual user codes include LLOOPS (the Livermore 
loops), MATEST (performs the matrix inversion technique developed by Ferguson and one 
of the authors), NASKERN (the NAS Kernel Benchmark Program with some tuning), 
and P1TEST (a computational computer test program). These tests were run using the 
most recent versions of the Fortran compilers available on each machine: CFT 2.63 on the 
Cray-2 and CFT 1.14 BF3 on the Cray X-MP. 

The figures in the column headed Ratio are the normal load MFLOPS figures of the 
Cray-2 divided by the normal load figures of the Cray X-MP/12. These rates are considered 
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Program 

Name 

Cray- 2 
Stand-Alone 

Cray-2 

Simultaneous 

Cray-2 

Normal 

Cray X-MP 
Normal 

Ratio 

(Percent) 

ARC3 

42.91 

26.98 

30.11 

51.35 

58.64 

ATRANS 

12.68 

10.81 

10.43 

21.10 

49.43 

BL3D 

44.98 

37.65 

37.81 

51.10 

73.98 

DERTRA 

18.51 

15.63 

15.78 

20.97 

75.23 

F3D 

32.51 

24.70 

26.46 

33.71 

78.51 

INS3D 

54.55 

38.93 

41.35 

52.75 

78.39 

LES 

90.36 

53.21 

55.34 

83.37 

66.38 

LLOOPS 

9.58 

9.28 

9.01 

14.89 

60.50 

MATEST 

394.55 

231.01 

244.31 

192.48 

126.93 

NASKERN 

94.17 

53.72 

57.28 

91.21 

62.80 

PITEST 

165.05 

161.16 

146.52 

131.20 

111.68 

PNS3D 

5.76 

5.24 

5.04 

10.77 

46.83 

SUNSX 

AVERAGE 

3.99 

3.75 

3.57 

9.56 

37.33 

71.28 


Table 1: Cray*2 and Cray X-MP Performance (MFLOPS) 


to be the most realistic system performance measures. Note, however, that the stand- 
alone figures for the Cray-2 on several programs, notably ARC3, LES, and MATEST, are 
considerably higher than the normal results. 

Several conclusions may be immediately drawn from the above data. First of all, the 
MFLOPS performance figures vary dramatically from program to program. This variance 
depends most strongly on the degree of vectorization. The performance ratio of the two 
machines also depends highly on the degree of vectorization. The Cray-2 out-performs 
the X-MP on some highly-vectorized programs, but on scalar codes the slower memory 
of the Cray-2 results in performance rates sharply lower than the X-MP. The average 
performance ratio listed in the table above indicates that users should expect about 30% 
slower performance on the Cray-2 for a program previously running on the Cray X-MP. 

Based on these results and some other comparisons, the types of codes that will likely 
perform well on the Cray-2 compared to the Cray X-MP include the following: 

• Library subroutine intensive codes (i.e. those codes that can utilize assembly-coded 
library subroutines to perform a significant part of their computation). 

• Register-intensive codes (i.e. those that perform only a few main memory stores and 
fetches for a given amount of computation). 

• Very small memory codes or other codes that can effectively utilize local memory. 
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The types of codes that will likely not perform well in comparison to the X-MP include 
the following: 

• Scalar codes (because such codes spend most of their time fetching from and storing 
to the slow main memory). 

• Partially vectorized codes (again because of the scalar disadvantage). 

• Main memory intensive codes (i.e. those that perform numerous main memory stores 
and fetches for a given amount of computation). 

• Codes with power-of-two memory strides in major loops. 

The slower observed performance of the Cray-2 is mitigated by the following two factors. 
First of all, CFT 1.14 on the Cray X-MP is significantly more mature than the Cray-2 
compiler. Not only is it more reliable, but its vectorization analysis is more advanced, and 
it has been highly tuned for the X-MP. Secondly, these test codes are mostly Cray X-MP 
codes written for that machine. Some of these codes are even derivatives of old CDC 
7600 codes. Only two of them were written specifically for the Cray-2, and the Cray-2 
out -performed the Cray X-MP in these cases. 

The two codes MATEST and PITEST were included in this list to demonstrate that 
the Cray- 2 is capable of truly astonishing performance on complete application codes. In 
each case these codes were written from the ground up specifically to run on the Cray-2. 
Care was taken to code all major loops in a style that would permit full vectorization with 
long vector lengths and a minimum of main memory activity. In the case of MATEST, 
the optimized library subroutine MXM was utilized to multiply matrices, which represents 
a large part of the computation. As a result, the performance rates of these programs 
were significantly higher than the others on the list, and in each case the Cray-2 ran the 
program faster than the Cray X-MP. 

Realistically, however, the Cray-2 should be expected to run about 30% slower than 
the X-MP on most Fortran codes, given the same amount of effort in optimization on each 
machine. Some improvements in the Cray-2 performance can be expected as the Fortran 
compiler matures, but it is not likely that many codes will run substantially faster as a 
result. Some codes will run faster as users increase the array sizes of their codes to take 
advantage of the larger main memory on the Cray-2. However, other codes already employ 
reasonably long vector lengths and will not run significantly faster with longer loop lengths. 

Utilization of local memory may help in some cases, but none of the CFD codes cur- 
rently in use seem to be able to make good use of it. The dramatic difference in performance 
between stand-alone and simultaneous runs in some of the cases above indicates that much 
of the Cray-2’s speed disadvantage may be due to memory bank contention, a problem cur- 
rently exacerbated by the disabling of pseudo-banking. If this is the case, the performance 
of the Cray-2 would sharply improve if it could be retrofitted with static RAM chips in 
mainmemory. 
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In any event, it should be re-emphasized that the principal advantage of the Cray-2 is 
it s very large memory, which allows jobs that previously could only be run using massive 
disk or solid state device I/O to now run in main memory. This is a MAJOR advantage, 
and it should not be allowed to be overshadowed by the slightly slower performance of the 
Cray-2 on some codes. 
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